Goals:

Expose you to some foundational techniques & vocab for analyzing text. There is a focus on “bag of words” techniques rather than diving into deep learning techniques (foundations first). Brevity was valued over agonizing detail.

# data manipulation
library(dplyr)

# text stuff
library(tidytext)
library(SnowballC)
library(textstem)
library(textdata)

# visuals
library(wordcloud2)
library(ggplot2)

We’ll be leveraging the tidytext package due to its approachability and focus on the data.frame as the main object type for analysis.

The package has an accompanying book for free online here: https://www.tidytextmining.com/.

The data being used is every line in the script from the TV show: The Office. The data was sourced from this post on Reddit, and can be found directly in this google sheet.

First vocab words!!

  • A corpus is the full collection of text that you’re analyzing
  • A document is the buckets of text within the corpus
    • Here a document would be a line of dialogue; although we use the word “document”, it has a flexible definition depending on what naturally groups your text together in the corpus.
office <- read.csv("the_office_script.csv")
dim(office)
## [1] 59909     7
names(office)
## [1] "id"        "season"    "episode"   "scene"     "line_text" "speaker"  
## [7] "deleted"

0.1 Text as a “bag of words”

Many traditional methods for text analytics treat text as an unordered collection of words. This leads to many techniques that boil down to counting how many times different words occur. Many more modern methods take more advantage of neural networks and word embeddings to focus more on a word’s context.

However, bag of words techniques are still worthwhile due to them being:

  • foundational learning to working with text
  • great cost-benefit for many applications’ needs

0.2 “Tokenizing”

The unit of analysis in a bag of words approach is a single word or a “token”.

tidytext’s approach to this is the unnest_tokens function. It will break up each document into it’s individual tokens while still keeping useful ID information in the rows.

office_tokens <- office %>%
  unnest_tokens(word, line_text)

office_tokens %>%
  select(speaker, word) %>%
  head()
##   speaker        word
## 1 Michael         all
## 2 Michael       right
## 3 Michael         jim
## 4 Michael        your
## 5 Michael quarterlies
## 6 Michael        look

We can now start to do different counting tasks using typical R data.frame methods.

Most commonly occurring words shown below. Ahh the insights!

top_tokens <- office_tokens %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  arrange(-count)

top_tokens %>%
  head()
## # A tibble: 6 × 2
##   word  count
##   <chr> <int>
## 1 i     23599
## 2 you   22173
## 3 the   17722
## 4 to    16637
## 5 a     15218
## 6 and   11158

0.3 “Stop words”

In the past section we tried to do an analysis of the most common words and it turns out the most common words are boring… Luckily we’re not the first ones to come across this issue. These are referred to as stop words, there are different lists we can use to remove these. The tidytext package provides a stop_words data.frame that holds 3 separate lists; we can either choose one or just use them all.

# stop words per list
stop_words %>%
  group_by(lexicon) %>%
  summarise(count = n()) %>%
  arrange(-count)
## # A tibble: 3 × 2
##   lexicon  count
##   <chr>    <int>
## 1 SMART      571
## 2 onix       404
## 3 snowball   174
# example of some stopwords
stop_words[sample(nrow(stop_words), size = 8), ]
## # A tibble: 8 × 2
##   word     lexicon
##   <chr>    <chr>  
## 1 said     SMART  
## 2 very     onix   
## 3 went     SMART  
## 4 contains SMART  
## 5 far      onix   
## 6 backed   onix   
## 7 while    onix   
## 8 seemed   SMART

To remove these words from our tokens we can do some filtering.

filtered_top_tokens <- top_tokens %>%
  anti_join(stop_words, by = "word")

paste(nrow(top_tokens) - nrow(filtered_top_tokens), "instances of stop words removed")
## [1] "675 instances of stop words removed"
wordcloud2(filtered_top_tokens)

There’s still some words that you might not find too valuable for this analysis. Keep in mind that you aren’t limited to using the words in a pre-definded list. There might be some times where building an industry specific set of stop words might make sense.

0.3.1 Practice!

Practice tokenizing text and data manipulation by:

  • Finding the top speaker in the office by number of lines
  • Finding the top speaker in the office by number of words said
  • Does this differ?
  • Which speaker has the longest lines on average?

Practice stop word removal by:

  • Finding the top speaker in the office by number of words said, but be sure to exclude stop words
  • Make a word cloud of 1 or more of the characters, but be sure to exclude stop words

0.4 Stemming & lemmatization

0.4.1 Stemming

Sometimes we don’t want to differentiate between different tenses/usages of the same word; for example, below the word “stop” is used in 4 different ways. Depending on the insights you want to find you might want to keep these separate or you might to combine them into just one “stop” category.

Chopping of the endings of these words will leave you with the stem of “stop”.

stops <- top_tokens %>% 
  filter(grepl("stop", word)) %>% 
  head(4)

stops
## # A tibble: 4 × 2
##   word     count
##   <chr>    <int>
## 1 stop       614
## 2 stops       45
## 3 stopped     35
## 4 stopping    20

Below we apply the wordStem() function from the SnowballC package. This function leaves us with just the stem of each word.

stops %>% 
  mutate(stem = wordStem(word))
## # A tibble: 4 × 3
##   word     count stem 
##   <chr>    <int> <chr>
## 1 stop       614 stop 
## 2 stops       45 stop 
## 3 stopped     35 stop 
## 4 stopping    20 stop

If we repeat the same style word counting analysis now we might get some different results (might not… :shrug:).

In the below cell the complete analysis is restarted from scratch. Tokenize -> remove stop words -> stem -> aggregate. I’ve added the example_word column that shows what the word might have looked like before stemming (just 1 of the many potential words). In some cases it can be tough to see what the word was before stemming (eg see “hei” below which was originally “hey” or “gui” which was originally “guy”)

office %>% 
  unnest_tokens(word, line_text) %>% 
  anti_join(stop_words, by = "word") %>% 
  mutate(stem = wordStem(word)) %>% 
  group_by(stem) %>% 
  summarise(count = n(), example_word = first(word)) %>% 
  arrange(-count)
## # A tibble: 14,949 × 3
##    stem    count example_word
##    <chr>   <int> <chr>       
##  1 yeah     3227 yeah        
##  2 michael  2657 michael     
##  3 hei      2420 hey         
##  4 dwight   2144 dwight      
##  5 jim      1911 jim         
##  6 pam      1629 pam         
##  7 uh       1624 uh          
##  8 gui      1616 guys        
##  9 gonna    1555 gonna       
## 10 time     1496 times       
## # … with 14,939 more rows

0.4.2 Lemmatization

Word stemming is a rule based approach to finding word stems: remove ‘s’ from end of word, remove ‘ing’ from end of word, etc. This has the pro of being flexible to unique words that might not even appear on urbandictionary. This has the con of missing words that don’t follow the typical rules for changing tense.

The 2 below cases can explain this point more directly.

stops <- c("stops", "stopping", "stopped")
swims <- c("swim", "swam", "swum")

Word stemming can do well when removing endings is the right move, but it will fail on words that follow patterns that don’t concern the word ending.

wordStem(stops)
## [1] "stop" "stop" "stop"
wordStem(swims)
## [1] "swim" "swam" "swum"

A different but like minded process of finding the root of the token is “lemmatization”. We want to find the “lemma” of each input word. This is shown below to work on both of our example cases. This has the opposite pros/cons of stemming. Lemmatization relies on a dictionary based approach, so if there is unusual vocab in your corpus it might not be very effective.

lemmatize_words(stops)
## [1] "stop" "stop" "stop"
lemmatize_words(swims)
## [1] "swim" "swim" "swim"

PS no one is stopping you from doing both stemming and lemmatization. If you apply both, definitely apply lemmatization before stemming. Lemmatization always outputs a valid English word, stemming can make some words unrecognizable (which would hurt lemmatizing).

0.5 Sentiment analysis with dictionaries

“Sentiment analysis” is a term used where we try and describe text based on what feelings it emits. This can be feelings of “positive” vs “negative” or it can get more into nuanced sentiments like “anger”, “fear”, “trust”.

One way to do sentiment analysis is with a dictionary based approach. This has the typical limitations of dictionary based approaches: it only works if the word is in your dictionary.

The tidytext package provides routes to multiple sentiment dictionaries:

  • “afinn” - provides a word and its score. the score’s sign (pos or neg) indicates a positive or negative sentiment. the score’s magnitude indicates the strength of the sentiment

  • “bing” - provides a word and a label of the word as positive or negative

  • “nrc” - provides a word and a label of the word’s sentiment. provides a diverse set of sentiment labels (ie sadness, anger, etc)

  • “loughran” - provides a word and a label of the word’s sentiment. Developed from financial reports so this is a powerful or useless dictionary depending on the context: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”

Example using afinn

afinn <- get_sentiments("afinn")

sum_sentiment_by_char <- office_tokens %>% 
  left_join(afinn, by = "word") %>% 
  group_by(speaker) %>% 
  summarise(sentiment = sum(value, na.rm = TRUE))

# Most negative speakers
sum_sentiment_by_char %>% 
  arrange(sentiment) %>% 
  head(3)
## # A tibble: 3 × 2
##   speaker                       sentiment
##   <chr>                             <dbl>
## 1 Vance Refrigeration Worker #2       -35
## 2 Devon                               -18
## 3 Teddy                               -14
# Most positive speakers
sum_sentiment_by_char %>% 
  arrange(-sentiment) %>% 
  head()
## # A tibble: 6 × 2
##   speaker sentiment
##   <chr>       <dbl>
## 1 Michael      8869
## 2 Jim          3962
## 3 Pam          3420
## 4 Andy         2559
## 5 Dwight       2444
## 6 Erin          820

Example using bing

nrc <- get_sentiments("nrc")

sum_sentiment_by_char <- office_tokens %>% 
  left_join(nrc, by = "word") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(!is.na(sentiment)) %>% 
  group_by(speaker, sentiment) %>% 
  summarise(count = n(), example_word = first(word)) %>% 
  arrange(-count)
## `summarise()` has grouped output by 'speaker'. You can override using the
## `.groups` argument.
# Most common speaker x sentiment pairs
sum_sentiment_by_char %>% 
  head()
## # A tibble: 6 × 4
## # Groups:   speaker [2]
##   speaker sentiment    count example_word
##   <chr>   <chr>        <int> <chr>       
## 1 Michael positive      6092 library     
## 2 Michael negative      4151 mistake     
## 3 Michael trust         3352 guidance    
## 4 Michael anticipation  3346 deal        
## 5 Dwight  positive      3318 word        
## 6 Michael joy           3276 deal

A limitation of this dictionary approach is only considering a single word at time. For example, “good” can express positive sentiment, but what if I had “not” in front of it or “very”? Some methods try and look into this idea by negating or boosting the score of the following word. However it turns out that’s not enough since language can be very nuanced, for example “I just got over having the flu, what a great experience that was…”

Due to these limitations, more modern/cutting edge sentiment methods utilize deep learning. This isn’t to say that the dictionary methods are totally useless. If you know your corpus and have some expectations of the results (and follow-up analysis on results) you’ll be able to tell if the dictionary approach isn’t cutting it for your application.

0.5.1 Practice!

Practice tokenizing text and data manipulation by:

  • Finding the top sentiment per season (using any dictionary but afinn)
  • Finding the most positive & negative episodes (using afinn dictionary)

0.6 TF-IDF

Let’s revisit the motivation first using stop words. We like removing stopwords because they are so common that they don’t provide value, but general stopwords might not cut it. For example, in the world of The Office they work at the a paper company, maybe we want to consider “paper” a stop word.

Instead of always building a stop word dictionary, we can try and put a number to the value of a word. This is the goal of TF-IDF

TF stands for “term frequency” - its a measure of how often a word appears in a document IDF stands for “inverse document frequency” - its a measure of how many documents a word appears in

  • Words that appear a lot in many documents are going to essentially be like stopwords and have a low TFIDF.
  • Words that appear infrequently aren’t valued by TFIDF and will have a low score
  • Words that appear a lot but just in 1 document are likely words that are important to just that document and they’ll have a high TFIDF.

The below analysis removes stopwords and then calculates the tfidf of words by each office character. It turns out it’s a way to find the people they talk about the most (and often turns out to be a love interest finder per person).

# treat each speaker as a "document"
# what are the highest tf-idf words per speaker
tfidf_by_speaker <- office_tokens %>% 
  anti_join(stop_words, by = "word") %>% 
  group_by(word, speaker) %>% 
  summarise(n = n()) %>% 
  bind_tf_idf(word, speaker, n) %>% 
  arrange(-tf_idf)
## `summarise()` has grouped output by 'word'. You can override using the
## `.groups` argument.
tfidf_by_speaker %>% 
  filter(n > 50) %>% 
  head(8)
## # A tibble: 8 × 6
## # Groups:   word [6]
##   word    speaker     n     tf   idf tf_idf
##   <chr>   <chr>   <int>  <dbl> <dbl>  <dbl>
## 1 michael Jan       204 0.0720  1.76 0.126 
## 2 michael David      75 0.0603  1.76 0.106 
## 3 andy    Erin      126 0.0265  2.43 0.0643
## 4 michael Holly      52 0.0326  1.76 0.0573
## 5 ryan    Kelly      66 0.0197  2.83 0.0557
## 6 god     Kelly      61 0.0182  2.47 0.0451
## 7 uh      Toby       80 0.0255  1.76 0.0450
## 8 dwight  Angela    112 0.0221  2.03 0.0448

0.6.1 Practice!

Practice TF-IDF analysis by:

  • Finding the top TF-IDF words per season
  • Finding the top TF-IDF words per episode
  • Create a visual to display one of these (or by character)
    • See here for an example of creating TF-IDF barcharts with the tidytext and ggplot2 packages